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Abstract 

We present a novel, highly efficient algorithm to parallelize 0(A^^)direct summa- 
tion method for A^-body problems with individual timesteps on distributed-memory 
parallel machines such as Beowulf clusters. Previously known algorithms, in which 
all processors have complete copies of the A'^-body system, has the serious problem 
that the communication-computation ratio increases as we increase the number of 
processors, since the communication cost is independent of the number of proces- 
sors. In the new algorithm, p processors are organized as a y/p x yjp two-dimensional 
array. Each processor has N/^ particles, but the data are distributed in such a 
way that complete system is presented if we look at any row or column consisting 
of y/p processors. In this algorithm, the communication cost scales as N/^Jp, while 
the calculation cost scales as N'^/p. Thus, we can use a much larger number of 
processors without losing efficiency compared to what was practical with previously 
known algorithms. 

PACS: 02.60. Ch;95. 10. Ce; 98.10.-hz 
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1 Introduction 

In this paper we present a novel algorithm to parallelize the direct sum- 
mation method for astrophysical A^-body problems, either with and with- 
out the individual timestep algorithm. The proposed algorithm works also 
with the Ahmad-Cohen neighbor scheme (Ahmad and Cohen 1973), or with 
GRAPE special-purpose computers for iV-body problems (Sugimoto et al. 



Preprint submitted to Elsevier Preprint 



1 February 2008 



1990, Makino and Taiji 1998). Our algorithm is designed to offer better scal- 
ing of the communication-computation ratio on distributed-memory multi- 
computers such as Beowulf PC clusters (Sterling et al. 1999) compared to 
traditional algorithms. 

This paper will be organized as follows. In section 2 we describe the traditional 
algorithms to parallelize direct summation method on distributed-memory 
parallel computers, and the scaling of communication time and computational 
time as functions of the number of particles N and number of processor p. 
It will be shown that for previously known algorithms the calculation time 
scales as 0{N'^/p), while communication time is 0{N + logp). Thus, even 
with infinite number of processors the total time per timestep is still 0{N), 
and wc cannot use more than 0{N) processors without losing efficiency. 0{N) 
sounds large, but the coefficient is rather small. Thus, it was not practical to 
use more than 10 processors for systems with a few thousand particles, on 
typical Beowulf clusters. 

In section 3 we describe the basic idea of our new algorithm. It will be shown 
that in this algorithm the communication time is 0{N/^ + logp). Thus, we 
can use O(iV^) processors without losing efficiency. This implies a large gain 
in speed for relatively small number of particles such as a few thousands. We 
also briefly discuss the relation between our new algorithm and the hyper- 
systolic algorithm (Lippert et al. 1998). In short, though the ideas behind the 
two algorithms are very different, the actual communication patterns are quite 
similar, and therefore the performance is also similar for the two algorithms. 
Our algorithm shows a better scaling and also is much easier to extend to 
individual timestep and Ahmad- Cohen schemes. 

In section 4 we discuss the combination of our proposed algorithm and individ- 
ual timestep algorithm and the Ahmad-Cohen scheme. In section 5, we present 
examples of estimated performance. In section 6 we discuss the combination 
of our algorithm with GRAPE hardwares. In section 7 we sum up. 



2 Traditional approaches 



The parallelization of the direct method has been regarded simple and straight- 
forward [see, for example, (Fox et al. 1994)]. However, it is only so ii N » p 
and if we use simple shared-timestep method. In this section, we ffist discuss 
the communication-calculation ratio of previously known algorithms for the 
shared timestep method, and then those for individual timestep algorithm 
with and without the Ahmad-Cohen scheme. 
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2.1 Shared timestep 



Most of the textbooks and papers discuss the ring algorithm. Suppose we 
calculate the force on N particles using p processors. We connect the processors 
in a one dimensional ring, and distribute particles so that each processor 
has N/p particles (figure 1). Here and hereafter, we assume that N is integer 
multiple of p, to simplify the discussion. 

The ring algorithm calculates the forces on N particles in the following steps. 

(1) Each processor calculates the interactions between N/p particles within 
it. Calculation cost of this step is Cf{N/pY /2, where C/ is the time to 
calculate interaction between one pair of particles. 

(2) Each processor sends all of its particles to the same direction. Here we call 
that direction "right" . Thus all processors sends its particles to their right 
neighbors. The communication cost is CcN/p + C^, where Cc is the time 
to send one particle to the neighboring processor and Cs is the startup 
time for communication. 

(3) Each processor accumulates the force from particles they received to its 
own particles. Calculation cost is C/(iV/p)^. If force from all particles is 
accumulated, go to step 5. 

(4) Each processor then sends the particles it received in the previous step 
to its right neighbor, and goes back to previous step. 

(5) Force calculation completed. 

The time for actual calculation is given by 

7f,Hng = CfN^/p, (1) 

and the communication time 

Tc,nng = CeiV + C,p. (2) 

The total time per one timestep of this algorithm is 

Tnng = 7f,Hng + T^.^ng = CfN^/p + C,N + C,p. (3) 



Here, we neglect small correction factors of order 0{l/p). 

For fixed number of particles, the calculation cost (first term in equation 3) 
scales as 1 /p while communication cost increases. Therefore, for large p we see 
the decrease in the efficiency. Here we define efficiency as 

V^ing -^f,ring/-^ring) (4) 
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Fig. 1. The ring algorithm on 4 processors for 8 particles. In stage each processor 
calculate the interactions between its particles. In stage 1 each processor sends its 
share to its right neighbor, and calculates the force from particles it received to 
particles it has. In stages 2 and 3 (stage 3 omitted from the figure), each processor 
sends what it received in the previous stage. Force calculation completes at stage 3. 

which reduces to 

'^""^ " l + C,p/{CfN)+CsP^/{CfN^y 

Thus, to achieve the efficiency better than 50%, the number of processor p 
must be smaller than 

N 



Phalf,nng = ^(V^c' + sC f - C,). (6) 



Equation (6) can be simplified in the two limiting cases 

Phalf,ring ~ | N^Cf/Cs {Cl « CsCf) 



(7) 



In most of distributed-memory multicomputers, Cc >> Cj. For example, with 
a 1 Gflops processor, we have C/ ~ 3 x 10~®sec. If this processor is connected to 
other processor with the communication link of the effective speed of lOMB/s, 
Cc ~ 3 X lO^^sec. The value of Cs varies depending on both networking 
hardware and software. Table 1 gives the order-of-magnitude values for these 
coefficients for several platforms. 
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Table 1 

Time coefficients in seconds 



Network 




Cf 


Cs 






\JCf/Cs 


Cf/Cc 


Fast Ethernet 


3 


X 10-^ 


10-4 


3 


X 10"^ 


0.017 


0.01 


Gigabit Ethernet 


3 


X 10"^ 


10-4 


3 


X 10"^ 


0.017 


0.1 


Myrinet 


3 


X 10"^ 


10-5 


3 


X IQ-'^ 


0.05 


0.1 



With the Fast Ethernet, Phaif,ring is A^/100. Faster networks help to increase 
Phaif,ringj but we Can see that even with a very low- latency network system 
like Myrinet, Phaif,rmg is latency-limited. Fast, high-latency network such as 
Gigabit Ethernet does not offer much improvement over Fast Ethernet. 

A possible alternative of the ring algorithm is the copy algorithm (figure 2), 
in which all processors have complete copy of the system at the beginning of 
the timestep. Then, each processor calculates the forces on its share of N/p 
particles, and integrates their orbits. After the orbit integration is done, the 
updated particles in each processor must be broadcasted to all other pro- 
cessors. If we use a simple ring algorithm to broadcast the data, the scaling 
characteristic of this algorithm is exactly the same as that of the ring algo- 
rithm. 

If the communication network has the connectivity better than 1-D ring, we 
can use the so-called message-combining technique, which is essentially the 
same algorithm as what is used for global summation. Assume that we have 
2^ processors with fully connected network. At the first stage, each processor 
sends its data to its right neighbor. At the second stage, each processor sends 
both its original data and the data it received at the first stage to its second 
right neighbor, and in the third stage to fourth neighbor. In this way, all 
processors receive the data of all other processors after s = log2P stages. 
Note that the amount of the data received is the same as the ring algorithm. 
Only the startup overhead is reduced. With this implementation of the copy 
algorithm, the total time becomes 

Tcopy = Tf,Hng + T.^ring = CfN^p + C,N + \og^ p. (8) 

In this case, we can almost always ignore the contribution of the last term, 
and the number of processors with 50% efficiency is given by 

Phalf,copy ~ 

NCfIC,. (9) 

Thus, we can use a larger number of processors, as far as the memory of 
individual node is large enough to keep the copy of the complete system. We 
could also implement some hybrid of the ring and copy algorithm to reduce 
the memory requirement. 
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Fig. 2. The copy algorithm on 4 processors for 8 particles. In stage each processor 
calculates the forces on its share of particles and integrates their orbits. In stage 1, 
processor broadcasts its share of particles to all other processors. In the following 
stages (not shown in the figure), each processor broadcasts its share in turn. 



2.2 Individual timestep 



Here, what we actually consider is the so-called blockstep method (McMillan 
1986, Makino 1991), where we organize the timesteps of particles so that the 
number of particles share exactly the same time is becomes maximum. This is 
clearly desirable to achieve high efficiency on parallel computers. Wc denote 
the average number of particles which share the same time as n. According to 
a simple theoretical estimate, n oc N'^^^ if the central density of the system is 
not very high (Makino 1991). 

It's not easy to extend the ring algorithm so that it can handle the block- 
step algorithm. A naive approach is just to pass around all particles in the 
system in the same way as in the case of the shared timestep. In this case, 
calculation cost is CfNn/p while communication cost is CcN. Thus, relatively 
speaking communication becomes more expensive by a factor of N/n, which is 
essentially the gain of calculation speed by the individual timestep algorithm. 

The copy algorithm is much easier to extend to the blockstep method. At each 
blockstep, each processor determines which particles it will update, so that it 
updates n/p particles which are not overlapped with particles updated by 
other processors. Then they calculate the forces on them and integrate their 
orbits. At the end of the timestep, they broadcast the updated particles. The 
calculation cost is CfNn/p, while the communication cost is CcTi + Cslog2p. 
Thus, the communication is reduced by the same factor as that for the calcu- 
lation cost. 
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Note, however, that the effect of the startup overhead of communication is 
increased. We can still ignore the startup time if 

logapC, < CeTl, (10) 

which is okay except for the case of the Gigabit ethernet. Thus, individual 
timestep does not change the scalability of the algorithm. 

We cannot easily use more than n processors, since each processor should have 
at least one particles to integrate. At least under the assumption of n = A^^/^, 
Phaif,copy becomes larger than n for A" > 1000, for the case of Myrinet. In 
this case, we can distribute the work to calculate the force on one particle to 
several processors. This of course increases the communication cost, but the 
increase is around Ccp/n which is always smaller than CcU. 

In theory, it is possible to extend the ring algorithm in the following way. 
Instead of passing around the particles which exert the forces, we can pass 
around the particles which receives the force. In this way, the ring algorithm 
can achieve the same scaling as the copy algorithm. The naive use of the ring 
communication pattern again incurs the high startup cost. So it is necessary 
to use the message combining technique. 

The ring algorithm suffers a potential load imbalance problem, since particles 
are assigned to fixed processors. At any given blockstcp, there is no guarantee 
that the number of particles to be integrated is balanced. Of course, we can 
try to migrate particles with small timestep from heavily loaded processors 
to lightly loaded processors, and vice versa, to reduce the imbalance. The 
effectiveness of such an algorithm is an open question, though we expect the 
load balancing would probably work fine. 

For the individual timestep method, however, there is no reason to prefer 
the (modified) ring algorithm over the copy algorithm, since the efficiency of 
the ring algorithm at the best only matches that of the copy algorithm. The 
potential advantage of the ring algorithm is that it requires less memory, since 
the particles are distributed over processors. However, since the number of 
particles we consider here is relatively small, the memory requirement is not 
very important. 

2.3 The Ahmad- Cohen scheme 

The copy algorithm is problematic to extend to the Ahmad-Cohen scheme. 
The problem is that informations to be passed around increases since now 
each particle has its own neighbor list. This list must be communicated to all 
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processors. On the other hand, the calculation cost is further reduced. 

The ring algorithm is better than the copy algorithm here, since with the 
ring algorithm we do not have to pass around the neighbor list. Each proces- 
sor maintains partial neighbor lists for all particles, which contains only the 
particles on that processor. The scaling relation of calculation and communi- 
cation costs are essentially the same as that of the individual timcstcp, except 
that now total number of particles N is replaced with average number of force 
calculations per timestep. Theoretically, it is of the order of A^^/^(Makino and 
Hut 1988). This means we can now use only N^/'^Cf/Cc processors. Even 
with Myrinet, this number is rather small. For example, we can use only 100 
processors for N = 10^. 

2.4 Summary 

We overviewed known algorithms to parallelize the direct force calculation. All 
algorithms share the same problem that the relative cost of the communication 
goes up as we increase the number of processors. More troublesome is that 

even for fixed N and p, the relative cost of communication is higher for a 
more advanced algorithm, resulting in the partial cancelation of the gain in 
the speed achieved by advanced algorithms. 

Wc have not discussed the "hyper-systolic" algorithm(Lippert et al. 1998). 
We will briefly discuss them after we introduce our new algorithm in the next 
section. 



3 The new algorithm 

The problem with the traditional schemes is that each processor must receive 
the information of all particles updated at each timestep. This is the case for 
all algorithms described in the previous section. Therefore, the communication 
cost remains constant, while calculation cost decreases as we increase p. 

If all processors keep the complete copies of the system, it is clear that they 
must receive all data updated by other processors at each timestep. If all 
processors only store their own shares, they still have to receive all updated 
particles to calculate forces either to or from them. 

If we divide the processors to subgroups, and let the data be copied only 
within the subgroups, we may be able to reduce the communication cost. In 
the following, we'll discuss how such subdivision works. 
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Fig. 3. Two dimensional array of processors. In this example, a torus network is 
shown. 

Here, we assume that we have p = processors, where r is a positive integer, 
and that they are organized as r x r 2D grid (torus), processors are numbered 
as go,o to Qr-i,r-i- It is easy to extend the new algorithm to any rectangular grid 
with the geometry of r x kr, where k is a positive integer. Such configuration 
can reduce the memory requirement, but increases the communication cost. 
Therefore we do not discuss such non-square grids in this paper. 

We divide N particles to r subsets each with N/r particles, and let processors 
qx,i (here x, i means any index pair with column address i) to have i-th subset. 
Thus, processors qo,Q through g^-^o have the same data of zeroth group. 

The force calculation can be done by the following two steps. First, we apply 
the ring algorithm for each row. However, instead of sending all particles in 
each processor, they send only N/r"^ particles. To do so, within processors in 
the same column the particles are further subdivided to r sub-subgroups. 

After one rotation of the ring is finished, each processor obtained the partial 
force from N/r particles in the ring (row) to N/r particles in the column. 
Calculation cost of this stage is CjN'^/p and communication cost is CcN/r + 
Csr, since we send N/r"^ particles r times. 

In the second stage, we simply take the summation of r partial forces for each 
particles, which are distributed on the r processors in the same column. The 
total force is obtained on one processor, which then broadcasts them to all 
other processors in the same column. The communication cost depends very 
much on the connectivity of the network. In the worst case of 1-D network, the 
communication cost of this stage is CcN. The summation takes r steps in 1-D 
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Fig. 4. The 2D ring algorithm. 

ring, and each processor has to send N/r data. Of course, if we can overlap 
calculation and communication, the scaling can be 2CcN/r + C^r, which is 
far more preferable. With full crossbar networks in this column direction, the 
communication cost becomes 2CcA'" log2 r/r + 2Cs log2 r, assuming that we 
cannot overlap communication and calculation. If we can overlap them, the 
communication cost is reduced to 2CcN/r + 2Cslog2r. In the following, for 
simplicity we assume that the communication cost of this stage is the same as 
that for the first stage. 

The total time per one timestep of this algorithm is 

TsDring = Tf,2Dring + T^aDring = CfN^r'' + 2{CcN/r + C,r). (11) 



For the number of processors p — r'^ for which the efficiency is 50%, we have 
a cubic equation. For the two limiting cases, the solution is given as 



,2DrinsJ 

Phalf,2Dnng ~ | N^Cf/C,)^' {N > Ar,,2Dring) 



(12) 
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Table 2 

Critical values of A'^ for different algorithms 



Network 



,2Dring ^c,2Dbcast ^c,ind,2D 



Fast Ethernet 



300 3 X 10^ 900 



Gigabit Ethernet 



0.3 



90 



200 



Myrinet 



3 



60 



70 



where iVc,2Dring is defined as 




(13) 



C^fCs 



The values of A^c,2Dring for typical networks are given in table 2. Clearly, N is 
almost always larger than A''c,2Dring- This means that the total performance is 
limited by the latency of the network. 

Even so, the number of processors we can use with this 2D algorithm is signif- 
icantly larger than that for ID ring, for any value of A^. If < N^, we can use 
0{N'^) processors. Even if A'" > A^c,2Dring, we can still use 0{N'^^^) processors. 

In this 2D ring algorithm, the 0(r) term in the communication cost limits the 
total performance. We can reduce this term by using the extension of the copy 
algorithm to 2D. 

Instead of using the ring algorithm in the first stage, processors qa broadcast 
their data to all other processors in the same row. After this broadcast proces- 
sor qij has both group i and group j. Then each processor calculates the force 
on particles they received (group i) from particles they originally have (group 
j). In this scheme, the communication cost is reduced to NCc/r + Cg if the 
network switch supports the broadcast. If the network does not support the 
broadcast, the cost varies between CcN/r + CgV for the case of a ring network 
and CcN/r -\- Cg \og^ r for a full crossbar. 

In the second stage, summation is now taken over the processors in the same 
row. Here, result for row i must be obtained on processor qa, which then 
broadcasts the forces to all processors in the same column. After this broad- 
cast, all processors have the forces on all particles in them. They can then use 
this forces to integrate the orbits of particles. 

In this algorithm, the time integration calculation is duplicated over r pro- 
cessors in the same column, but in most cases this does not matter. One 
alternative is that processor qa performs the time integration and broadcasts 
the updated data of particles to other processor in the same column. Yet an- 
other possibility is to let each of r processor to integrate N/r'^ particles, which 
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Fig. 5. The 2D broadcast algorithm. 

each of them then broadcasts within the column. Which approach is the best 
depends on the ratio between calculation speed, communication speed and 
communication startup overhead. 

The total time per one timestep of this algorithm is 

TsDbcast = Tf,2Dbcast + Te,2Dbcast = CfN^T^ + 2{CcN/r + log^ r). (14) 



For the number of processors p = for which the efficiency is 50%, we have 
a cubic equation. For the two limiting cases, the solution is given as 

J {NCf/2C,f (N < 

,2DbcastJ; /ii-N 

where A^c,2Dbcast is defined as 

-/Vc,2Dbcast = y ^ exp Mog2 • (1^) 
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The critical value of A^, A''c,2Dbcast , is larger than that for the 2D ring version 
of the algorithm. This is because we reduced the C^r term in the communica- 
tion cost to Cglogr. More importantly, even for N > iVc_2Dbcast) Phaif is only 
logarithmically smaller than 0{N'^). Thus, with this broadcast version of the 
algorithm we can really use 0{N^) processors and still achieve high efficiency. 

3.1 Relation with the hyper-systolic algorithm 

Now the relation between our algorithm and the hyper-systolic algorithm 
(Lippert et al. 1998) must be obvious. The "regular bases" version of the 
hyper-systolic algorithm applied to processors works in the essentially the 
same way as the ring version of our algorithm works, though in order to de- 
rive our algorithm we do not need to use any complex concepts like /i-range 
problem or Additive Number Theory. To put things in a slightly different way, 
the hyper-systolic algorithm is a complex way to reconstruct combination of 
rowwize ring and columwize summation on a 2D network by a sequence of 
shift operations in 1-D ring network. 

Thus, as far as the C^r term is small, our algorithm and the hyper-systolic 
algorithm show the same scaling. However, since C^r term would almost always 
limit the scaling, the broadcast version of our algorithm is almost always better 
than the hyper-systolic algorithm. In addition, our algorithm is by far easier 
to understand and implement. 

This simplicity of our algorithm makes it possible to extend our algorithm to 
the individual timestep scheme and even to the Ahmad-Cohen scheme, which 
will be discussed in the following sections. 



4 Application to Individual timestep 

4-1 Standard individual timestep 

If we use the broadcast version as the base, the extension to the individual 
timestep method is trivial. Instead of broadcasting all particles in the first 
stage, we broadcast only the particles in the current block. In the following 
steps, we always send only data related with the particles in the current block. 
Everything else is the same as in the case of the shared timestep algorithm. 
Using the same assumption of n ~ N^/^, we have 

ri„d,2D = CfN'^yr^ + 2{C,N^/yr + C, log^ r), (17) 
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and 



{{NCf/2C,Y (A^<A^c,ind,2D), 
Phalf,ind,2D | N^/^Cf/Cs \og^{N^/^C f /2Cs) (N > 7Ve,ind,2D), 



(18) 



where A^c,ind,2D is now given by the following implicit equation 

Arl/3 1 I ^c,ind,2DC'/\ 4C^ , , 

A^c,U2Dl0g2^^ (19) 



For the example values in table 1, the value of A'c,ind,2D is fairly small. So 
we can use only 0{N^/^) processors. However, this is still much larger than 
the number of processors that can be used with ID implementation of the 
individual timestep algorithm. 

The same load-balance problem as we have discussed in the case of the copy 
algorithm occurs with this method. We need some load-balancing strategy to 
actually use this method. 



4 -2 Application to the Ahmad- Cohen scheme 



The difference from the individual timestep scheme is that the neighbor list is 
created/used to calculate the forces, the neighbor hst for forces from particles 
in group j to particles in group i is created, stored and used only by processor 

Qij. Therefore, there is no increase in the communication cost, except for the 
summation of the number of neighbors. The total calculation time and the 
50% efficiency processor count are given by: 

Tac,2D = CfN^"'^/r^ + 2{C,N^'^/r + C, log^ r) , (20) 



and 

Phalf,AC,2D 



{N^'\Cf/2C,f (iV<iVc,AC,2D), .211 

\7ViVi2c^/C,log,(AriVi2c^/2C,) (7V>7Vc,ac,2d), ^ ^ 



where A^c,ac,2D is now given by the following implicit equation 



<AC,2Bl0g2|^^!^li^l=|^. (22) 



2Cs I CfC,^ 



Note that, in this case, whether or not N > Ai'c,AC,2D makes very small dif- 
ference for the number of processors, since the difference is only of the order 
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Fig. 6. Efficiencies of various algorithms discussed in the text. Thin curves corre- 
spond to ID algorithms and thick curves to 2D algorithms. Solid, dashed and dotted 
curves represent shared timestep, individual timestep and Ahmad-Cohen scheme, 
respectively. Number of particles is 10^ and a Myrinet-like network is assumed. 

of {N / Nc^ac,2dY^^'^ ■ Thus, practically we can say that we can use 0{N^^'^) 
processors with the 2D version of the Ahmad-Cohen scheme. 



5 Compctrison of traditional and proposed algorithms 

In this section, we present the theoretical comparison of the proposed algo- 
rithm and the traditional one- dimensional algorithm. First we show the result 
for the case of Myrinet-like fast network. 

Figures 6 to 8 show the efficiencies for three different values of as the 
function of the number of processors p. It is clear that 2D algorithms allow us 
to use much larger number of processors compared to their ID counterparts. 
The gain is larger for larger A^, but becomes smaller as we use more advanced 
algorithms. The gain for individual timestep is smaller than that for shared 
timestep, and that for the Ahmad-Cohen scheme is even smaller. 

Even so, the gain in the processor count is more than a factor of 5, for the case 
of the Ahmad-Cohen scheme and A^ = 10^. We believe this is quite a large 
gain in the parallel efficiency. 

Figure 9 shows the efficiencies for the case of the Fast Ethernet. With the 
2D algorithm we can use more than 500 processors even with Ahmad Co- 
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Fig. 7. Same as figure 6 but for N = 10^. 
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Fig. 8. Same as figure 6 but for N = 10^. 



hen scheme, for = 10^. PC clusters with inexpensive networks seem to be 
very attractive platforms to implement parallel version of the Ahmad-Cohen 
scheme. 
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Fig. 9. Same as figure 8 but for Fast Ethernet. 



6 Combination with GRAPE 



The only thing GRAPE does is to greatly reduce the value of C/. Thus, the 
same 2D network of processors each with one GRAPE processor works fine, 
if the cost of the frontend is less than that of a GRAPE processor. 

GRAPE-6 achieves essentially the same effect as this 2D processor grid, but 
using only r hosts and GRAPE processors connected with a rather elaborate 
multistage networks. In hindsight, such an elaborate network is unnecessary, 
if the fast network is available for a low cost. 

From the point of view of the scaling relations, what a GRAPE hardware 
changes is simply C/. If we attach a ITfiops GRAPE hardware to a 1 Gflops 
host, we reduce C/ by a factor of 10^. This means that the limiting factor for 
the number of processors is almost always Cc and not Cg- Thus, for a par- 
allel GRAPE system, high-throughput, high-latency network such as Gigabit 
Ethernet is a practical choice. 

Figures 10 and 11 shows the efficiencies for GRAPE-6 system. Since C/ is 
much smaller, the number of processors we can use becomes much smaller. 
Of course, in the case of ID algorithms, the actual speed we can achieve does 
not depend on C/, since the total speed is p/Cj and phaif is proportional to 
Cf. Figure 11 indicates that we can achieve the speed of multiple pctafiops 
with currently available technology, by configuring several thousand GRAPE-6 
boards into a 2D network. 
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Fig. 10. Same as figure 8 but for C/ = 3 x 10"^^ 
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Fig. 11. Same as figure 10 but for N = 10^. 

7 Summciry 

We described a new two-dimensional algorithm to implement the direct sum- 
mation method on distributed-memory parallel computers. The basic idea of 
the new algorithm is to organize processors to r x r 2D network, and let the 
data be shared both rowwize and columnwize. In this way, we can reduce the 
communication cost from 0{N) of the previously know algorithms to 0{N/r). 
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For the case of the shared timestep algorithm, the new algorithm behaves 
in essentially the same as the "regular bases" version of the hyper-systolic 
algorithm does. However, with the broadcast version of our algorithm the 
communication overhead is reduced, which resulted in the better scaling. Also, 
our algorithm is much simpler, which helped us to extend our algorithm to 
individual timestep and the Ahmad-Cohen scheme, as well as combination 
with GRAPE hardwares. 

For all cases, compared to the previously known algorithm, the number of 
processors we can use without losing efficiency is almost squared. This is a 
quite large improvement in the efficiency of a parallel algorithm. 

Usually, a paper which proposes a new parallel algorithm should offer the 
verification of the concept, by means of the timing measurement of actual 
implementation. In this paper we omit this verification, because we believe 
it's important to let those who working on hyper-systolic algorithms be aware 
of simpler alternatives. 

In this paper, we assumed a network with full connectivity. This assumption 
is okay with small PC clusters, but not true on large MPPs. In this case, 
parallel efficiency of ID algorithm is significantly reduced, and relative gain of 
2D algorithm would become much larger. 



Acknowledgments 

I thank Rainer Spurzcm and Yoko Funato for valuable discussions. This work 
is supported in part by the Research for the Future Program of Japan Society 
for the Promotion of Science (JSPS-RFTP97P01102). 



References 

Ahmad, A. &; Cohen, L., 1973, Journal of Computational Physics, 12, 389 

Fox, G. C, Williams, R. D., & Messina, P. C, 1994, Parallel Computing Works!, 
Morgan Kaufmann, San Francisco. 

Lippert, T., Seyfried, A., Bode, A., & Schilling, K., 1998, IEEE Transactions on 
Parallel and Distributed Systems, 9, 97 

Makino, J. & Hut, P., 1988, ApJS, 68, 833 

Makino, J. &: Taiji, M., 1988, Scientific Simulations with Special-Purpose Computers 
— The GRAPE Systems, John Wiley and Sons, Chichester. 



19 



Makino, J., 1991, PASJ, 43, 859 



McMillan, S. L. W., 1986, The vectorization of small-n integrators, in edited by 
P. Hut & S. McMillan (Eds), The Use of Supercomputers in Stellar Dynamics, 
Springer, New York, pp. 156-161. 

Sterling, T. L., Salmon, J., Becker, D. J., & Savarcsc, D. F., 1999, How to Build a 
Beowulf: A Guide to Implementation and Application of PC Clusters, MIT Press, 
Cambridge. 

Sugimoto, D., Chikada, Y., Makino, J., Ito, T., Ebisuzaki, T., &Umemura, M., 1990, 
Nature, 345, 33 



20 



